我们提出了一种新型自然语言文本的嵌入,该文本深深代表语义含义。标准文本嵌入使用验证语言模型的矢量输出。在我们的方法中,我们让语言模型从文本中学习,然后从字面上挑选其大脑,从而将模型神经元的实际权重产生向量。我们称此文本表示为神经嵌入。该技术可能会超越文本和语言模型,但我们首先探索其自然语言处理的属性。我们将神经嵌入与几个数据集上的GPT句子(SGPT)嵌入进行比较。我们观察到,神经嵌入使用较小的模型实现可比的性能,并且错误是不同的。
translated by 谷歌翻译
事实一致性是重要的总结评估尺寸之一,特别是摘要一代变得更流畅和连贯。最近专门用于事实一致性的estime措施与人类专家评分的高相关性,以便在原则上仅限于评估具有高词典重叠的这些文本概要对。这不是当前摘要风格的问题,但它可能成为未来摘要系统的障碍,或者用于对文本评估任意索赔。在这项工作中,我们概括了该方法,使其适用于任何文本摘要对。由于estime使用上下文相似点,它提供了对不同伯特层所采取的信息的有用性提供了见解。我们观察到除了几个最低的几乎所有层中存在有用的信息。对于一致性和流利 - 众所周知的素质 - 最有用的层靠近顶部(但不是顶部);对于一致性和相关性,我们发现了一个更复杂和有趣的画面。
translated by 谷歌翻译
我们呈现了名字,一个从英语维基百科和新闻文章中获得的暧昧名称的实体的数据集。它由4148个独特实体的58862提到和他们的名称:来自News的1000个提到,来自Wikipedia关于实体的文章28843,以及29019维基百科反向链接提到。名称应该有助于为命名实体链接的任务建立具有挑战性的基准(NEL)。
translated by 谷歌翻译
质量摘要数据集的创建是一种昂贵,耗时的努力,需要通过训练有素的人类和机器的摘要生产和评估。如果这种努力用一种语言制作,则能够在不重复人类注释的情况下用其他语言使用它是有益的。要调查我们可以信任此类数据集的机器翻译多少,我们将英文数据集汇总到七种语言,并在自动评估措施中比较性能。我们将等同性测试探讨为适当的统计范式,用于评估摘要人类自动评分之间的相关性。虽然我们发现类似于源类似语言的数据集重用潜力,但在横跨翻译中找不到大多数摘要评估方法。
translated by 谷歌翻译
摘要的目标是简明地说明文件中最重要的信息。在这一原则上,我们介绍了新的参考摘要评估指标,该评估指标使用预训练的语言模型来估计文档与其摘要之间共享的信息内容。这些指标是在香农游戏的现代化,这是几十年前提出的摘要质量评分的方法,在那里我们用语言模型替换人类的注释器。我们还将这些指标视为Blanc的扩展,最近提出的摘要质量测量方法,基于语言模型的性能,而无需摘要。采用基于变压器的语言模型,我们经验验证了我们的指标与人类判断的最先进的相关性,与既有一致性和相关性的摘要质量尺寸,以及与人为判断的一致性和流畅性的竞争相关性。
translated by 谷歌翻译
Correct scoring of a driver's risk is of great significance to auto insurance companies. While the current tools used in this field have been proven in practice to be quite efficient and beneficial, we argue that there is still a lot of room for development and improvement in the auto insurance risk estimation process. To this end, we develop a framework based on a combination of a neural network together with a dimensionality reduction technique t-SNE (t-distributed stochastic neighbour embedding). This enables us to visually represent the complex structure of the risk as a two-dimensional surface, while still preserving the properties of the local region in the features space. The obtained results, which are based on real insurance data, reveal a clear contrast between the high and low risk policy holders, and indeed improve upon the actual risk estimation performed by the insurer. Due to the visual accessibility of the portfolio in this approach, we argue that this framework could be advantageous to the auto insurer, both as a main risk prediction tool and as an additional validation stage in other approaches.
translated by 谷歌翻译
$ $With recent advances in CNNs, exceptional improvements have been made in semantic segmentation of high resolution images in terms of accuracy and latency. However, challenges still remain in detecting objects in crowded scenes, large scale variations, partial occlusion, and distortions, while still maintaining mobility and latency. We introduce a fast and efficient convolutional neural network, ASBU-Net, for semantic segmentation of high resolution images that addresses these problems and uses no novelty layers for ease of quantization and embedded hardware support. ASBU-Net is based on a new feature extraction module, atrous space bender layer (ASBL), which is efficient in terms of computation and memory. The ASB layers form a building block that is used to make ASBNet. Since this network does not use any special layers it can be easily implemented, quantized and deployed on FPGAs and other hardware with limited memory. We present experiments on resource and accuracy trade-offs and show strong performance compared to other popular models.
translated by 谷歌翻译
MD4 and MD5 are seminal cryptographic hash functions proposed in early 1990s. MD4 consists of 48 steps and produces a 128-bit hash given a message of arbitrary finite size. MD5 is a more secure 64-step extension of MD4. Both MD4 and MD5 are vulnerable to practical collision attacks, yet it is still not realistic to invert them, i.e. to find a message given a hash. In 2007, the 39-step version of MD4 was inverted via reducing to SAT and applying a CDCL solver along with the so-called Dobbertin's constraints. As for MD5, in 2012 its 28-step version was inverted via a CDCL solver for one specified hash without adding any additional constraints. In this study, Cube-and-Conquer (a combination of CDCL and lookahead) is applied to invert step-reduced versions of MD4 and MD5. For this purpose, two algorithms are proposed. The first one generates inversion problems for MD4 by gradually modifying the Dobbertin's constraints. The second algorithm tries the cubing phase of Cube-and-Conquer with different cutoff thresholds to find the one with minimal runtime estimation of the conquer phase. This algorithm operates in two modes: (i) estimating the hardness of an arbitrary given formula; (ii) incomplete SAT-solving of a given satisfiable formula. While the first algorithm is focused on inverting step-reduced MD4, the second one is not area-specific and so is applicable to a variety of classes of hard SAT instances. In this study, for the first time in history, 40-, 41-, 42-, and 43-step MD4 are inverted via the first algorithm and the estimating mode of the second algorithm. 28-step MD5 is inverted for four hashes via the incomplete SAT-solving mode of the second algorithm. For three hashes out of them this is done for the first time.
translated by 谷歌翻译
Accurate mapping of forests is critical for forest management and carbon stocks monitoring. Deep learning is becoming more popular in Earth Observation (EO), however, the availability of reference data limits its potential in wide-area forest mapping. To overcome those limitations, here we introduce contrastive regression into EO based forest mapping and develop a novel semisupervised regression framework for wall-to-wall mapping of continuous forest variables. It combines supervised contrastive regression loss and semi-supervised Cross-Pseudo Regression loss. The framework is demonstrated over a boreal forest site using Copernicus Sentinel-1 and Sentinel-2 imagery for mapping forest tree height. Achieved prediction accuracies are strongly better compared to using vanilla UNet or traditional regression models, with relative RMSE of 15.1% on stand level. We expect that developed framework can be used for modeling other forest variables and EO datasets.
translated by 谷歌翻译
Large language models (LLMs) have been shown to be able to perform new tasks based on a few demonstrations or natural language instructions. While these capabilities have led to widespread adoption, most LLMs are developed by resource-rich organizations and are frequently kept from the public. As a step towards democratizing this powerful technology, we present BLOOM, a 176B-parameter open-access language model designed and built thanks to a collaboration of hundreds of researchers. BLOOM is a decoder-only Transformer language model that was trained on the ROOTS corpus, a dataset comprising hundreds of sources in 46 natural and 13 programming languages (59 in total). We find that BLOOM achieves competitive performance on a wide variety of benchmarks, with stronger results after undergoing multitask prompted finetuning. To facilitate future research and applications using LLMs, we publicly release our models and code under the Responsible AI License.
translated by 谷歌翻译